52 research outputs found

    Error, reproducibility and sensitivity : a pipeline for data processing of Agilent oligonucleotide expression arrays

    Get PDF
    Background Expression microarrays are increasingly used to obtain large scale transcriptomic information on a wide range of biological samples. Nevertheless, there is still much debate on the best ways to process data, to design experiments and analyse the output. Furthermore, many of the more sophisticated mathematical approaches to data analysis in the literature remain inaccessible to much of the biological research community. In this study we examine ways of extracting and analysing a large data set obtained using the Agilent long oligonucleotide transcriptomics platform, applied to a set of human macrophage and dendritic cell samples. Results We describe and validate a series of data extraction, transformation and normalisation steps which are implemented via a new R function. Analysis of replicate normalised reference data demonstrate that intrarray variability is small (only around 2% of the mean log signal), while interarray variability from replicate array measurements has a standard deviation (SD) of around 0.5 log2 units ( 6% of mean). The common practise of working with ratios of Cy5/Cy3 signal offers little further improvement in terms of reducing error. Comparison to expression data obtained using Arabidopsis samples demonstrates that the large number of genes in each sample showing a low level of transcription reflect the real complexity of the cellular transcriptome. Multidimensional scaling is used to show that the processed data identifies an underlying structure which reflect some of the key biological variables which define the data set. This structure is robust, allowing reliable comparison of samples collected over a number of years and collected by a variety of operators. Conclusions This study outlines a robust and easily implemented pipeline for extracting, transforming normalising and visualising transcriptomic array data from Agilent expression platform. The analysis is used to obtain quantitative estimates of the SD arising from experimental (non biological) intra- and interarray variability, and for a lower threshold for determining whether an individual gene is expressed. The study provides a reliable basis for further more extensive studies of the systems biology of eukaryotic cells

    Optimality Driven Nearest Centroid Classification from Genomic Data

    Get PDF
    Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each feature individually, without consideration for how a subset of features performs as a whole. We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here. This allows us to develop a new greedy algorithm to estimate this optimal nearest-centroid classifier with a given number of features. In addition, whereas the centroids are usually formed from maximum likelihood estimates, we investigate the applicability of high-dimensional shrinkage estimates of centroids. We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers

    A comprehensive re-analysis of the Golden Spike data: Towards a benchmark for differential expression methods

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The Golden Spike data set has been used to validate a number of methods for summarizing Affymetrix data sets, sometimes with seemingly contradictory results. Much less use has been made of this data set to evaluate differential expression methods. It has been suggested that this data set should not be used for method comparison due to a number of inherent flaws.</p> <p>Results</p> <p>We have used this data set in a comparison of methods which is far more extensive than any previous study. We outline six stages in the analysis pipeline where decisions need to be made, and show how the results of these decisions can lead to the apparently contradictory results previously found. We also show that, while flawed, this data set is still a useful tool for method comparison, particularly for identifying combinations of summarization and differential expression methods that are unlikely to perform well on real data sets. We describe a new benchmark, AffyDEComp, that can be used for such a comparison.</p> <p>Conclusion</p> <p>We conclude with recommendations for preferred Affymetrix analysis tools, and for the development of future spike-in data sets.</p

    An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++

    Get PDF
    Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license

    Blood Signature of Pre-Heart Failure: A Microarrays Study

    Get PDF
    International audienceBACKGROUND: The preclinical stage of systolic heart failure (HF), known as asymptomatic left ventricular dysfunction (ALVD), is diagnosed only by echocardiography, frequent in the general population and leads to a high risk of developing severe HF. Large scale screening for ALVD is a difficult task and represents a major unmet clinical challenge that requires the determination of ALVD biomarkers. METHODOLOGY/PRINCIPAL FINDINGS: 294 individuals were screened by echocardiography. We identified 9 ALVD cases out of 128 subjects with cardiovascular risk factors. White blood cell gene expression profiling was performed using pangenomic microarrays. Data were analyzed using principal component analysis (PCA) and Significant Analysis of Microarrays (SAM). To build an ALVD classifier model, we used the nearest centroid classification method (NCCM) with the ClaNC software package. Classification performance was determined using the leave-one-out cross-validation method. Blood transcriptome analysis provided a specific molecular signature for ALVD which defined a model based on 7 genes capable of discriminating ALVD cases. Analysis of an ALVD patients validation group demonstrated that these genes are accurate diagnostic predictors for ALVD with 87% accuracy and 100% precision. Furthermore, Receiver Operating Characteristic curves of expression levels confirmed that 6 out of 7 genes discriminate for left ventricular dysfunction classification. CONCLUSIONS/SIGNIFICANCE: These targets could serve to enhance the ability to efficiently detect ALVD by general care practitioners to facilitate preemptive initiation of medical treatment preventing the development of HF

    Reduction of the contaminant fraction of DNA obtained from an ancient giant panda bone

    Get PDF
    Objective: A key challenge in ancient DNA research is massive microbial DNA contamination from the deposition site which accumulates post mortem in the study organism’s remains. Two simple and cost-effective methods to enrich the relative endogenous fraction of DNA in ancient samples involve treatment of sample powder with either bleach or Proteinase K pre-digestion prior to DNA extraction. Both approaches have yielded promising but varying results in other studies. Here, we contribute data on the performance of these methods using a comprehensive and systematic series of experiments applied to a single ancient bone fragment from a giant panda (Ailuropoda melanoleuca). Results: Bleach and pre-digestion treatments increased the endogenous DNA content up to ninefold. However, the absolute amount of DNA retrieved was dramatically reduced by all treatments. We also observed reduced DNA damage patterns in pre-treated libraries compared to untreated ones, resulting in longer mean fragment lengths and reduced thymine over-representation at fragment ends. Guanine–cytosine (GC) contents of both mapped and total reads are consistent between treatments and conform to general expectations, indicating no obvious biasing effect of the applied methods. Our results therefore confirm the value of bleach and pre-digestion as tools in palaeogenomic studies, providing sufficient material is available

    A Multi-Cancer Mesenchymal Transition Gene Expression Signature Is Associated with Prolonged Time to Recurrence in Glioblastoma

    Get PDF
    A stage-associated gene expression signature of coordinately expressed genes, including the transcription factor Slug (SNAI2) and other epithelial-mesenchymal transition (EMT) markers has been found present in samples from publicly available gene expression datasets in multiple cancer types, including nonepithelial cancers. The expression levels of the co-expressed genes vary in a continuous and coordinate manner across the samples, ranging from absence of expression to strong co-expression of all genes. These data suggest that tumor cells may pass through an EMT-like process of mesenchymal transition to varying degrees. Here we show that, in glioblastoma multiforme (GBM), this signature is associated with time to recurrence following initial treatment. By analyzing data from The Cancer Genome Atlas (TCGA), we found that GBM patients who responded to therapy and had long time to recurrence had low levels of the signature in their tumor samples (Pβ€Š=β€Š3Γ—10βˆ’7). We also found that the signature is strongly correlated in gliomas with the putative stem cell marker CD44, and is highly enriched among the differentially expressed genes in glioblastomas vs. lower grade gliomas. Our results suggest that long delay before tumor recurrence is associated with absence of the mesenchymal transition signature, raising the possibility that inhibiting this transition might improve the durability of therapy in glioma patients

    Pre-processing Agilent microarray data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Pre-processing methods for two-sample long oligonucleotide arrays, specifically the Agilent technology, have not been extensively studied. The goal of this study is to quantify some of the sources of error that affect measurement of expression using Agilent arrays and to compare Agilent's Feature Extraction software with pre-processing methods that have become the standard for normalization of cDNA arrays. These include log transformation followed by loess normalization with or without background subtraction and often a between array scale normalization procedure. The larger goal is to define best study design and pre-processing practices for Agilent arrays, and we offer some suggestions.</p> <p>Results</p> <p>Simple loess normalization without background subtraction produced the lowest variability. However, without background subtraction, fold changes were biased towards zero, particularly at low intensities. ROC analysis of a spike-in experiment showed that differentially expressed genes are most reliably detected when background is not subtracted. Loess normalization and no background subtraction yielded an AUC of 99.7% compared with 88.8% for Agilent processed fold changes. All methods performed well when error was taken into account by t- or z-statistics, AUCs β‰₯ 99.8%. A substantial proportion of genes showed dye effects, 43% (99%<it>CI </it>: 39%, 47%). However, these effects were generally small regardless of the pre-processing method.</p> <p>Conclusion</p> <p>Simple loess normalization without background subtraction resulted in low variance fold changes that more reliably ranked gene expression than the other methods. While t-statistics and other measures that take variation into account, including Agilent's z-statistic, can also be used to reliably select differentially expressed genes, fold changes are a standard measure of differential expression for exploratory work, cross platform comparison, and biological interpretation and can not be entirely replaced. Although dye effects are small for most genes, many array features are affected. Therefore, an experimental design that incorporates dye swaps or a common reference could be valuable.</p

    Benzoxazinoids in Root Exudates of Maize Attract Pseudomonas putida to the Rhizosphere

    Get PDF
    Benzoxazinoids, such as 2,4-dihydroxy-7-methoxy-2H-1,4-benzoxazin-3(4H)-one (DIMBOA), are secondary metabolites in grasses. In addition to their function in plant defence against pests and diseases above-ground, benzoxazinoids (BXs) have also been implicated in defence below-ground, where they can exert allelochemical or antimicrobial activities. We have studied the impact of BXs on the interaction between maize and Pseudomonas putida KT2440, a competitive coloniser of the maize rhizosphere with plant-beneficial traits. Chromatographic analyses revealed that DIMBOA is the main BX compound in root exudates of maize. In vitro analysis of DIMBOA stability indicated that KT2440 tolerance of DIMBOA is based on metabolism-dependent breakdown of this BX compound. Transcriptome analysis of DIMBOA-exposed P. putida identified increased transcription of genes controlling benzoate catabolism and chemotaxis. Chemotaxis assays confirmed motility of P. putida towards DIMBOA. Moreover, colonisation essays in soil with Green Fluorescent Protein (GFP)-expressing P. putida showed that DIMBOA-producing roots of wild-type maize attract significantly higher numbers of P. putida cells than roots of the DIMBOA-deficient bx1 mutant. Our results demonstrate a central role for DIMBOA as a below-ground semiochemical for recruitment of plant-beneficial rhizobacteria during the relatively young and vulnerable growth stages of maize
    • …
    corecore